ORLA 6541 - Exercise 1
The following code and write-up contains answers to in-text exercises in Wickham & Grolemund (2017): Chapters 3-6 and Bulut & Dejardins (2019) Chapter 4.
A blank grey square
library(tidyverse)
ggplot(data=mpg)
234 rows and 11 columns
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
The drv variable describes the type of drive train for each car in the data set, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.
?mpg
ggplot(data = mpg)+geom_point(mapping=aes(x=cyl, y=hwy))
Since both variables are categorical, the points align based on the intersection of both categories, which does not provide any useful information. Scatterplots are more useful for mapping the relationship between two continuous variables.
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
## 3.3.1 Exercises #### 1. What’s gone wrong with this code? Why are the
points not blue? ggplot(data = mpg) + geom_point(mapping = aes(x =
displ, y = hwy, color = “blue”))
aes() is used to map an aesthetic to a variable. In this case, we wanted to associate the aesthetic color and with the actual color “blue”. Since “blue” is not a variable, we cannot map it in aesthetic, and had to set the aesthetic properties of our geom manually (outside of aes() . As such:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
?mpg
Categorical: manufacturer, model, trans, drv, fl, class Continuous: displ, year, cyl, cty, hwy
You can see this information when running ?mpg under each variable’s name
Assigning color to a continuous variable provides a numerical scale of colors while assigning color to a categorical variable categorizes all levels of the variable by different colors.
Similarly, assigning size to a continuous variable provides a numerical scale of sizes, while assigning size to a categorical variable (although not advised), categorizes all levels of the variable by different sizes.
Continuous variables can not be mapped to shape. While assigning shape to a categorical variable, all levels of the variable by different shapes (with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty))
You get a scatterpllot with an overlap of two aesthetics, both convey the same information.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty, alpha=cty))
It is used to modify the width of the border of shapes with borders.
?geom_point
It categorizes the variable assigned to the x-axis (in this case, displ) to < or > 5 by two different colors.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ<5))
We get a new column for each value of the variable.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cty, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
The empty cells in
facet_grid suggest that the variables
did not intersect within these values. This relates to the first plot
because we can notice that in the first plot, there is much empty space
and if we were to facet it, we are likely to have empty cells as
well.
. allows us to facet the plot by all levels of the
variable provided without needing to specify the levels.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Advantages: it is easier to identify patterns within each value of a
variable; we can have a larger number of facets while still being able
to identify which facet represents which value, while with a large
number of colors, it may be difficult to discriminate between values.
Disadvantages: It may be more challenging to compare patterns between
values of the variable and to identify overall patterns.
?facet_wrap
nrow determines the number of rows in the facet. Other
options that control the layout of the individual panels are:
ncol, scale, and dir. ncol determines the
number of columns in the facet. facet_grid() does not have
number of nrow and ncol arguments because they
are determined by the number of unique levels the variables have.
Because if you put the variable with the more unique levels in the rows, it would create a taller then wider plot, which may be hard to follow and interpret on most computers.
line chart - geom_line() boxplot -
geom_boxplot() histogram -
geom_histogram()
This will create a scatter plot with displ as the x variable and hwy as the y variable. drv will be represented by both lines and errors colored by the levels of the variable, and their standard error would not appear.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
show.legend = FALSE hides the legend box in the of the plot. It plausible that it was removed to ensure the plot’s size is consistent with previous plots.
It visulizes the confidence intervals of the line as shaded areas; TRUE to show, FALSE to hide.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
No, they will be identical since Because both geom_point() and geom_smooth() will uses the same data and mapping noted in the ggplot(), which is the same for both plots.
ggplot(data=mpg, mapping=aes(x = displ, y = hwy,)) +
geom_point(size=5) +
geom_smooth(se = FALSE, size=2)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_smooth(mapping = aes(group = drv), se = FALSE, size=2) +
geom_point(size=5)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv))+
geom_smooth(se = FALSE, size=2) +
geom_point(size=5)
ggplot(data = mpg, mapping = aes (x = displ, y = hwy))+
geom_smooth(se = FALSE, size=2) +
geom_point(mapping=aes(color=drv), size=5)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_smooth(mapping=aes(linetype=drv), se = FALSE, size=2) +
geom_point(mapping = aes(color=drv), size=5)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_point(size = 5, color = "white") +
geom_point(aes(color = drv))
?stat_summary
the default geom associated with stat_summary is geom_pointrange().
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
?geom_col
?geom_bar
There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.
Pairs of geoms and stats share names in common and typically have the same default stat.
?stat_smooth
Two parameters that control stat_smooth()'s behavior are
method and formula. Method specifies the kind
of smoothing function to apply, like a linear model (lm) or a
generalized linear model (glm), and formula allows the user to provide
the variables of interest for the given modeling strategy. In addition,
users can also adjust the fullrange parameter, which tells
R to either fit the smoothing for just the data or the full plot.
stat_smooth() generates the following variables:
y or x predicted value of the given variable(s)
ymin or xmin lower pointwise confidence interval around the mean
ymax or xmax upper pointwise confidence interval around the mean
se standard error
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))
The problems is that all of the bars are the same height, which is inaccurate. Setting group=1 ensures that the proportions for each stack are calculated using each subset.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
It is possible that many points in the plot overlap and, therefore, hide
each other. We can improve the plot by using the geom_jitter() function
to add a small amount of random noise to each point, which will reveal
any overlapping points. As such:
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
Width, controlling the horizontal random noise, and height, controlling the vertical random noise.
?geom_jitter
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()
While
geom_jitter adds random noise to each point to reduce
over plotting of points, geom_count changes the size of
each point per the number of observation for its specific value.
Although geom_count can help to reduce over plotting; however, it may
create over plotting by itself due to the size of the points.
?geom_boxplot
The default position adjustment for geom_boxplot() is dodge2.
ggplot(data = mpg, aes(x = drv, y = hwy)) +
geom_boxplot()
ggplot(mpg, aes(x = class, fill = drv)) + geom_bar()
ggplot(mpg, aes(x = class, fill = drv)) + geom_bar() + coord_polar()
?labs()
labs() allows users to set plot labels. Within the
labs() function, users can provide details for host of plot
labels, like title, subtitle, caption, and more.
?coord_quickmap
coord_map() from ggplot2 allows users to map their data
onto a 2D projection of the earth. Due to the curvature of the earth,
the documentation notes that map projections do not preserve straigth
lines. As such, coord_quickmap() can be used alternatively,
which performs calculations to provide an approximation of the requested
area of the globe but with straight lines.
The plot below suggests a strong positive relationship between city miles per gallon to highway miles per gallon; as one rises, so does that other.
coord_fixed() is important because it ensures a fixed ratio between the physical representation of data points on the X and Y axes. This makes it easier to identify patterns in the plot.
geom_abline() adds a reference line to a plot, either horizontal, vertical, or diagonal, which is useful for annotating plots. In this case, it added a horizontal line to the plot, which made it easier to identify the relationship between city and highway mpg.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
coord_fixed()
my_variable <- 10 my_varıable #> Error in eval(expr, envir, enclos): object ‘my_varıable’ not found
This code does not work because the “I” in the second row of the code is capital, which does not match the “i” above.
library(tidyverse)
ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
fliter(mpg, cyl = 8) filter(diamond, carat > 3)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
filter(mpg, cyl == 8)
## # A tibble: 70 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
## 2 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 3 chevrolet c1500 sub… 5.3 2008 8 auto… r 11 15 e suv
## 4 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 5 chevrolet c1500 sub… 5.7 1999 8 auto… r 13 17 r suv
## 6 chevrolet c1500 sub… 6 2008 8 auto… r 12 17 r suv
## 7 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
## 8 chevrolet corvette 5.7 1999 8 auto… r 15 23 p 2sea…
## 9 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
## 10 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
## # … with 60 more rows
filter(diamonds, carat > 3)
## # A tibble: 32 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
## 2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58 12300 9.44 9.4 5.85
## 9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
## 10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
## # … with 22 more rows
Tip from: Dr. Rebecca Hirst @HirstR omg did you know you can access #R #cheatsheets straight from Rstudio?!
I’m heading back under my rock now 🙈 #coding #rstats
To follow Dr. Hirst’s tip, in RStudio, go to: File > Help > Cheat Sheets
Warn if variable used has no definition in scope: Warn if a symbol is used with no definition in the current, or parent, scope. Warn if variable is defined but not used: helps to identify is a variable is created but never used. Provide R style diagnostics (e.g. whitespace): checks to see if your code conforms to Hadley Wickham’s style guide, and reports style warnings when encountered.
library(nycflights13)
library(tidyverse)
- Had an arrival delay of two or more hours
- Flew to Houston (IAH or HOU)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 811 630 101 1047 830
## 2 2013 1 1 848 1835 853 1001 1950
## 3 2013 1 1 957 733 144 1056 853
## 4 2013 1 1 1114 900 134 1447 1222
## 5 2013 1 1 1505 1310 115 1638 1431
## 6 2013 1 1 1525 1340 105 1831 1626
## 7 2013 1 1 1549 1445 64 1912 1656
## 8 2013 1 1 1558 1359 119 1718 1515
## 9 2013 1 1 1732 1630 62 2028 1825
## 10 2013 1 1 1803 1620 103 2008 1750
## # … with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # … with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, carrier %in% c("UA", "AA", "DL"))
## # A tibble: 139,504 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 600 -1 941 910
## 10 2013 1 1 559 600 -1 854 902
## # … with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, month %in% c(7, 8, 9))
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, arr_delay > 120, dep_delay <= 0)
## # A tibble: 29 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 27 1419 1420 -1 1754 1550
## 2 2013 10 7 1350 1350 0 1736 1526
## 3 2013 10 7 1357 1359 -2 1858 1654
## 4 2013 10 16 657 700 -3 1258 1056
## 5 2013 11 1 658 700 -2 1329 1015
## 6 2013 3 18 1844 1847 -3 39 2219
## 7 2013 4 17 1635 1640 -5 2049 1845
## 8 2013 4 18 558 600 -2 1149 850
## 9 2013 4 18 655 700 -5 1213 950
## 10 2013 5 22 1827 1830 -3 2217 2010
## # … with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dep_delay >= 60 , arr_delay < 30)
## # A tibble: 206 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 3 1850 1745 65 2148 2120
## 2 2013 1 3 1950 1845 65 2228 2227
## 3 2013 1 3 2015 1915 60 2135 2111
## 4 2013 1 6 1019 900 79 1558 1530
## 5 2013 1 7 1543 1430 73 1758 1735
## 6 2013 1 11 1020 920 60 1311 1245
## 7 2013 1 12 1706 1600 66 1949 1927
## 8 2013 1 12 1953 1845 68 2154 2137
## 9 2013 1 19 1456 1355 61 1636 1615
## 10 2013 1 21 1531 1430 61 1843 1815
## # … with 196 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
summary(flights$dep_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 907 1401 1349 1744 2400 8255
# max departure time is 2400, this represents midnight
filter(flights, dep_time <= 600 | dep_time == 2400)
## # A tibble: 9,373 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The between(x, left, right) function allows you to filter values where x is greater than or equal to left and less than or equal to right.
The code for exercise 1.4 can be simplified using the between() function as follows:
# Simplifying code using between() function
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, is.na(dep_time))
## # A tibble: 8,255 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 NA 1630 NA NA 1815
## 2 2013 1 1 NA 1935 NA NA 2240
## 3 2013 1 1 NA 1500 NA NA 1825
## 4 2013 1 1 NA 600 NA NA 901
## 5 2013 1 2 NA 1540 NA NA 1747
## 6 2013 1 2 NA 1620 NA NA 1746
## 7 2013 1 2 NA 1355 NA NA 1459
## 8 2013 1 2 NA 1420 NA NA 1644
## 9 2013 1 2 NA 1321 NA NA 1536
## 10 2013 1 2 NA 1545 NA NA 1910
## # … with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
NA^1 is not missing since mathematically any variable to the power of 0 is 1
NA | TRUE is true, since NA represents an unknown value and TRUE appears after the boolean operator “or”, therefore it returns TRUE
NA & FALSE is false since NA represents an unknown value and FALSE appears after the boolean operator “and”, therefore it returns FALSE
NA^0
## [1] 1
NA | TRUE
## [1] TRUE
NA & FALSE
## [1] FALSE
NA *0
## [1] NA
arrange(flights, desc(is.na(dep_delay)), dep_delay)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 NA 1630 NA NA 1815
## 2 2013 1 1 NA 1935 NA NA 2240
## 3 2013 1 1 NA 1500 NA NA 1825
## 4 2013 1 1 NA 600 NA NA 901
## 5 2013 1 2 NA 1540 NA NA 1747
## 6 2013 1 2 NA 1620 NA NA 1746
## 7 2013 1 2 NA 1355 NA NA 1459
## 8 2013 1 2 NA 1420 NA NA 1644
## 9 2013 1 2 NA 1321 NA NA 1536
## 10 2013 1 2 NA 1545 NA NA 1910
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# arrange dep_delay column in descending order
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# arrange dep_delay column in ascending order
arrange(flights, dep_delay)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, desc(distance/air_time))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 5 25 1709 1700 9 1923 1937
## 2 2013 7 2 1558 1513 45 1745 1719
## 3 2013 5 13 2040 2025 15 2225 2226
## 4 2013 3 23 1914 1910 4 2045 2043
## 5 2013 1 12 1559 1600 -1 1849 1917
## 6 2013 11 17 650 655 -5 1059 1150
## 7 2013 2 21 2355 2358 -3 412 438
## 8 2013 11 17 759 800 -1 1212 1255
## 9 2013 11 16 2003 1925 38 17 36
## 10 2013 11 16 2349 2359 -10 402 440
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, desc(distance))
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 857 900 -3 1516 1530
## 2 2013 1 2 909 900 9 1525 1530
## 3 2013 1 3 914 900 14 1504 1530
## 4 2013 1 4 900 900 0 1516 1530
## 5 2013 1 5 858 900 -2 1519 1530
## 6 2013 1 6 1019 900 79 1558 1530
## 7 2013 1 7 1042 900 102 1620 1530
## 8 2013 1 8 901 900 1 1504 1530
## 9 2013 1 9 641 900 1301 1242 1530
## 10 2013 1 10 859 900 -1 1449 1530
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
arrange(flights, distance)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 27 NA 106 NA NA 245
## 2 2013 1 3 2127 2129 -2 2222 2224
## 3 2013 1 4 1240 1200 40 1333 1306
## 4 2013 1 4 1829 1615 134 1937 1721
## 5 2013 1 4 2128 2129 -1 2218 2224
## 6 2013 1 5 1155 1200 -5 1241 1306
## 7 2013 1 6 2125 2129 -4 2224 2224
## 8 2013 1 7 2124 2129 -5 2212 2224
## 9 2013 1 8 2127 2130 -3 2304 2225
## 10 2013 1 9 2126 2129 -3 2217 2224
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
select (flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
select (flights, 4, 6, 7, 9)
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
select (flights, starts_with("dep"),starts_with("arr"))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
select (flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
select (flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # … with 336,766 more rows
What happens if you include the name of a variable multiple times in a select() call?
The variable is only displayed once. The duplicate is essentially ignored by the select() function
select (flights, dep_time, dep_time, dep_time)
## # A tibble: 336,776 × 1
## dep_time
## <int>
## 1 517
## 2 533
## 3 542
## 4 544
## 5 554
## 6 554
## 7 555
## 8 557
## 9 557
## 10 558
## # … with 336,766 more rows
What does the any_of() function do? Why might it be helpful in conjunction with this vector?
The any_of() function displays any of the variables, specified within the function
Using it in conjunction with the vector that’s been defined as vars below allows you to define the variables to filter out in one vector, prior to using the any_of() function within select()
This helps simplify the code written within select ()
Further, if you are looking to find any of the same variables, within multiple tables, it allows for cleaner duplication of the code (without having to list the variables multiple times)
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
It is surprising that the specified function ignores the case, by default
select(flights, contains("TIME"))
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # … with 336,766 more rows
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0
# dep_time
transmute(flights,
dep_time,
dep_time_hour = dep_time %/% 100,
dep_time_minute = dep_time %% 100,
)
## # A tibble: 336,776 × 3
## dep_time dep_time_hour dep_time_minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
# sched_dep_time
transmute(flights,
sched_dep_time,
sched_dep_time_hour = dep_time %/% 100,
sched_dep_time_minute = dep_time %% 100,
)
## # A tibble: 336,776 × 3
## sched_dep_time sched_dep_time_hour sched_dep_time_minute
## <int> <dbl> <dbl>
## 1 515 5 17
## 2 529 5 33
## 3 540 5 42
## 4 545 5 44
## 5 600 5 54
## 6 558 5 54
## 7 600 5 55
## 8 600 5 57
## 9 600 5 57
## 10 600 5 58
## # … with 336,766 more rows
Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
We would expect the computed value to be the same as the air_time value however values within arr_time - dep_time are > than air_time
This is because the values within arr_time and dep_time are displays of times an not in a format to run the calculation of the difference in times.
The first step that needs to be completed is to convert arr_time and dep_time to minutes
However completing this step and calculating the difference between arr_time and dep_time, still does not provide values that are equivalent to those in the air_time column. In fact, some values are negative
This is likely due to changes in time zone not being accounted for in our calculation. All arrival and departure times will first need to be converted to the same time zone, prior to completing the calculation for the values in both columns to be the same.
transmute(flights,
air_time,
air_time_calculated = arr_time - dep_time,
)
## # A tibble: 336,776 × 2
## air_time air_time_calculated
## <dbl> <int>
## 1 227 313
## 2 227 317
## 3 160 381
## 4 183 460
## 5 116 258
## 6 150 186
## 7 158 358
## 8 53 152
## 9 140 281
## 10 138 195
## # … with 336,766 more rows
# convert arrival time and departure time to hours and minutes
# this conversion converts midnight (2400) to 1440
# %% 1440 will convert these values to 0
transmute(flights,
air_time,
arr_time_mins = (arr_time %/% 100 *60 + arr_time %% 100 *60) %% 1440,
dep_time_mins = (dep_time %/% 100 *60 + dep_time %% 100 *60) %% 1440,
arr_dep_diff = arr_time_mins - dep_time_mins
)
## # A tibble: 336,776 × 4
## air_time arr_time_mins dep_time_mins arr_dep_diff
## <dbl> <dbl> <dbl> <dbl>
## 1 227 840 1320 -480
## 2 227 600 840 -240
## 3 160 480 1380 -900
## 4 183 840 60 780
## 5 116 1200 660 540
## 6 150 1380 660 720
## 7 158 1320 720 600
## 8 53 960 840 120
## 9 140 1320 840 480
## 10 138 720 900 -180
## # … with 336,766 more rows
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
We would expect dep_delay = dep_time - sched_dep_time
However running the calculation below, we observe that this is note the case.
As described above, this is likely due to the values within dep_time and sched_dep_time not being listed in minutes. These conversions would first need to be completed in order to complete this calculation
transmute (flights,
dep_time,
sched_dep_time,
dep_delay,
dep_diff = dep_time - sched_dep_time,
difference = dep_delay -dep_diff
)
## # A tibble: 336,776 × 5
## dep_time sched_dep_time dep_delay dep_diff difference
## <int> <int> <dbl> <int> <dbl>
## 1 517 515 2 2 0
## 2 533 529 4 4 0
## 3 542 540 2 2 0
## 4 544 545 -1 -1 0
## 5 554 600 -6 -46 40
## 6 554 558 -4 -4 0
## 7 555 600 -5 -45 40
## 8 557 600 -3 -43 40
## 9 557 600 -3 -43 40
## 10 558 600 -2 -42 40
## # … with 336,766 more rows
Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().
The min_rank() function assigns the same rank, when the values are tied, which makes sense for the type of ranking that we are trying to complete for the purposes of this exercise
Further, there are no tied values in the top 10 flights that were delayed
# create a data frame with delay rank included
flights1 <- mutate(flights,
delay_rank = min_rank(dep_delay)
)
flights1
## # A tibble: 336,776 × 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # delay_rank <int>
# arrange dataframe in descending order of delay rank
# use head function to extract first 10 rows that represent the top 10 most delayed flights
head (arrange(flights1, desc(delay_rank)), 10)
## # A tibble: 10 × 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # … with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, delay_rank <int>
1:3 + 1:10
## [1] 2 4 6 5 7 9 8 10 12 11
What trigonometric functions does R provide?
All trigonometric functions provided by R can be found under the help documentation for Trig
R provides the following functions:
And finally, atan2(y, x), which provides the angle between the x-axis and the vector from (0,0) to (x, y).
?Trig
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Which is more important: arrival delay or departure delay?
Ultimately what is most important to consider is what the organizational problem / decision that the data helps inform is.
Assessing the delay characteristics in the options mentioned above allows for trying to find answers to the root cause of these delays in multiple ways
If the analysis is used to help understand how to improve airport processes to minimize delays within the departure airport, then departure delay may be more important.
Alternatively if the data is used to help understand airport processes within the arrival airport could be improved, arrival delay would be more important.
Further, if the data is used to understand overall costs associated with delays, then both variables may be eqully important to understand in depth.
Thus, the research question at hand / the organizational problem in question will help determine how best to group data to inform decisions and which variables to focus on
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
count(dest)
## # A tibble: 104 × 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # … with 94 more rows
# a total count of flights that were not cancelled by destination is provided
# alternative method to generate this using group_by
not_cancelled %>%
group_by(dest)%>%
summarise (
total_not_cancelled = n()
)
## # A tibble: 104 × 2
## dest total_not_cancelled
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # … with 94 more rows
not_cancelled %>% count(tailnum, wt = distance)
## # A tibble: 4,037 × 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # … with 4,027 more rows
# alternative without using count
not_cancelled %>%
group_by(tailnum)%>%
summarise (
total = sum(distance)
)
## # A tibble: 4,037 × 2
## tailnum total
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # … with 4,027 more rows
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?
all arr_delay rows = NA however all dep_delay rows are not NA
That is, there are rows in which a value is logged for dep_delay yet may be missing for arr_del
This may be for flights that perhaps were delayed in taking off and then had to reroute and did not make it to the final destination
Thus the most important column is the arr_delay column
flights %>%
filter(is.na(dep_delay) | is.na(arr_delay)) %>%
select (dep_delay, arr_delay)%>%
filter (!is.na(arr_delay))
## # A tibble: 0 × 2
## # … with 2 variables: dep_delay <dbl>, arr_delay <dbl>
# all arr_delay rows = NA however all dep_delay rows are not NA
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
F9, EV, and YV had the highest average delays. US and HA had the lowest average delays in 2013.
flights %>%
filter(!is.na(dep_delay)) %>%
group_by(carrier) %>%
summarise(carrier_delay = mean(dep_delay)) %>%
arrange(desc(carrier_delay))
## # A tibble: 16 × 2
## carrier carrier_delay
## <chr> <dbl>
## 1 F9 20.2
## 2 EV 20.0
## 3 YV 19.0
## 4 FL 18.7
## 5 WN 17.7
## 6 9E 16.7
## 7 B6 13.0
## 8 VX 12.9
## 9 OO 12.6
## 10 UA 12.1
## 11 MQ 10.6
## 12 DL 9.26
## 13 AA 8.59
## 14 AS 5.80
## 15 HA 4.90
## 16 US 3.78
flights %>%
group_by(carrier) %>%
filter(!is.na(dep_delay)) %>%
summarize(avg_delay = mean(dep_delay))
## # A tibble: 16 × 2
## carrier avg_delay
## <chr> <dbl>
## 1 9E 16.7
## 2 AA 8.59
## 3 AS 5.80
## 4 B6 13.0
## 5 DL 9.26
## 6 EV 20.0
## 7 F9 20.2
## 8 FL 18.7
## 9 HA 4.90
## 10 MQ 10.6
## 11 OO 12.6
## 12 UA 12.1
## 13 US 3.78
## 14 VX 12.9
## 15 WN 17.7
## 16 YV 19.0
What does the sort argument to count() do. When might you use it?
The sort argument of the count() function allows the
user to show the largest groups that were counted at the top of the data
frame that is returned. This would be a good argument to use when
wanting to print counts in descending order, similar to
arrange(desc(x)). The code below uses the sort argument in
count() to show which airlines logged the most flights in
the data set.
flights %>%
count(carrier, sort = TRUE)
## # A tibble: 16 × 2
## carrier n
## <chr> <int>
## 1 UA 58665
## 2 B6 54635
## 3 EV 54173
## 4 DL 48110
## 5 AA 32729
## 6 MQ 26397
## 7 US 20536
## 8 9E 18460
## 9 WN 12275
## 10 VX 5162
## 11 FL 3260
## 12 AS 714
## 13 F9 685
## 14 YV 601
## 15 HA 342
## 16 OO 32
Using group_by prior to filtering or mutating means that
filters or calculations made will be done to within each level of the
grouping variable. As a result, in the example below, when we grouped by
month, and then calcualted the average delay, it did so for each month.
Mutate returned this number as a new column on the data set. The within
month calculations can be seen by the repeating numbers for each row.
The repeated numbers are expected since we wanted to take the mean
within the month.
flights %>%
group_by(month) %>%
filter(!is.na(dep_delay)) %>%
mutate(dep_delay_mean = mean(dep_delay)) %>%
select(month, dep_delay_mean) %>% head(10)
## # A tibble: 10 × 2
## # Groups: month [1]
## month dep_delay_mean
## <int> <dbl>
## 1 1 10.0
## 2 1 10.0
## 3 1 10.0
## 4 1 10.0
## 5 1 10.0
## 6 1 10.0
## 7 1 10.0
## 8 1 10.0
## 9 1 10.0
## 10 1 10.0
flights %>%
filter(!is.na(tailnum), !is.na(arr_time)) %>%
group_by(tailnum) %>%
summarise(mean_delay = mean(arr_delay, n=n())) %>%
arrange (desc(mean_delay))
## # A tibble: 4,037 × 2
## tailnum mean_delay
## <chr> <dbl>
## 1 N844MH 320
## 2 N911DA 294
## 3 N922EV 276
## 4 N587NW 264
## 5 N851NW 219
## 6 N928DN 201
## 7 N7715E 188
## 8 N654UA 185
## 9 N665MQ 175.
## 10 N427SW 157
## # … with 4,027 more rows
flights %>%
group_by(hour) %>%
summarise(mean_delay = mean(arr_delay, n=n())) %>%
arrange((mean_delay))
## # A tibble: 20 × 2
## hour mean_delay
## <dbl> <dbl>
## 1 7 -5.30
## 2 5 -4.80
## 3 6 -3.38
## 4 9 -1.45
## 5 8 -1.11
## 6 10 0.954
## 7 11 1.48
## 8 12 3.49
## 9 13 6.54
## 10 14 9.20
## 11 23 11.8
## 12 15 12.3
## 13 16 12.6
## 14 18 14.8
## 15 22 16.0
## 16 17 16.0
## 17 19 16.7
## 18 20 16.7
## 19 21 18.4
## 20 1 NaN
flights %>%
filter(arr_delay >0) %>%
group_by(dest) %>%
summarise(total_delay = sum(arr_delay, n=n())) %>%
arrange((total_delay))
## # A tibble: 103 × 2
## dest total_delay
## <chr> <dbl>
## 1 PSP 42
## 2 ANC 67
## 3 HDN 127
## 4 SBN 129
## 5 MTJ 175
## 6 EYW 204
## 7 BZN 507
## 8 JAC 636
## 9 CHO 966
## 10 MYR 1037
## # … with 93 more rows
flights %>%
filter(arr_delay >0) %>%
group_by(dest) %>%
mutate(
total_delay = sum(arr_delay),
prop_delay = arr_delay / total_delay)%>%
arrange((total_delay))
## # A tibble: 133,004 × 21
## # Groups: dest [103]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 26 1051 1055 -4 1401 1400
## 2 2013 12 21 1040 1030 10 1336 1335
## 3 2013 2 2 1051 1055 -4 1404 1400
## 4 2013 2 23 1053 1055 -2 1408 1400
## 5 2013 3 16 1046 1055 -9 1412 1355
## 6 2013 3 23 1050 1055 -5 1400 1355
## 7 2013 7 6 1629 1615 14 1954 1953
## 8 2013 7 13 1618 1615 3 1955 1953
## 9 2013 7 20 1618 1615 3 2003 1953
## 10 2013 8 3 1615 1615 0 2003 1953
## # … with 132,994 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # total_delay <dbl>, prop_delay <dbl>
delay_lags <- flights %>%
group_by(origin) %>%
mutate(dep_delay_lag = lag(dep_delay)) %>%
filter(!is.na(dep_delay), !is.na(dep_delay_lag))
ggplot(data = delay_lags, mapping = aes(x = dep_delay_lag, y = dep_delay)) +
geom_point () +
scale_x_continuous(breaks = seq(0, 1500, by = 100)) +
labs(y = "Flight Departure Delay", x = "Previous Flight Departure Delay")
#### 6 - Look at each destination. Can you find flights that are
suspiciously fast? (i.e. flights that represent a potential data entry
error).
Flight DL 1499 had a speed of 703 miles per hour. That is substantially faster than most of the other recorded speeds. It is possible that distance for this flight was entered as kilometers and not miles.
# fast flights
flights %>%
mutate(hour = air_time/60, speed = distance/hour) %>%
arrange(desc(speed)) %>% select(flight, speed) %>% head(10)
## # A tibble: 10 × 2
## flight speed
## <int> <dbl>
## 1 1499 703.
## 2 4667 650.
## 3 4292 648
## 4 3805 641.
## 5 1902 591.
## 6 315 564
## 7 707 557.
## 8 936 556.
## 9 347 554.
## 10 1503 554.
fast_flights <- flights %>%
filter(!is.na(air_time))%>%
group_by(dest) %>%
mutate(
shortest_air_time = min(air_time),
relative_air_time = air_time / shortest_air_time
)%>%
arrange (desc(relative_air_time))
fast_flights
## # A tibble: 327,346 × 21
## # Groups: dest [104]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 6 24 1932 1920 12 2228 2047
## 2 2013 6 17 1652 1700 -8 1856 1815
## 3 2013 7 23 1605 1400 125 1831 1511
## 4 2013 7 23 1617 1605 12 1833 1740
## 5 2013 7 23 1242 1237 5 1437 1351
## 6 2013 7 23 1200 1200 0 1428 1317
## 7 2013 2 17 841 840 1 1044 1003
## 8 2013 7 23 1242 1245 -3 1429 1355
## 9 2013 6 10 1356 1300 56 1646 1414
## 10 2013 6 29 755 800 -5 1035 909
## # … with 327,336 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # shortest_air_time <dbl>, relative_air_time <dbl>
flights %>%
group_by(dest, carrier) %>%
select(dest, carrier) %>%
ungroup() %>%
distinct(dest, carrier) %>% # unique destination and carrier combo
count(dest) %>% arrange(n) # show destinations with only one carrier
## # A tibble: 105 × 2
## dest n
## <chr> <int>
## 1 ABQ 1
## 2 ACK 1
## 3 ALB 1
## 4 ANC 1
## 5 BHM 1
## 6 BUR 1
## 7 BZN 1
## 8 CAK 1
## 9 CHO 1
## 10 CRW 1
## # … with 95 more rows
# same tidy chunk as above but now filtering out destination not = 1 carrier
multiple_carriers <- flights %>%
group_by(dest, carrier) %>%
select(dest, carrier) %>%
ungroup() %>%
distinct(dest, carrier) %>%
count(dest) %>% filter(!n == 1)
multiple_carriers %>%
arrange(desc(n)) # shows destinations with greatest number of carriers
## # A tibble: 76 × 2
## dest n
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 66 more rows
flights %>%
select(tailnum, year, month,day, dep_delay) %>%
filter(!is.na(dep_delay)) %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(cumulative_min = cumsum(dep_delay)) %>% # using a cumulative function to sequentially add the dep_delay minutes
filter(cumulative_min < 60) %>%
count(tailnum) %>% arrange(desc(n)) %>% head(15) # finally show flight number with total number of fligths 'n' prior to its first 60 min delay. Using a header of full data to limit output for report.
## # A tibble: 15 × 2
## # Groups: tailnum [15]
## tailnum n
## <chr> <int>
## 1 N957UW 286
## 2 N952UW 282
## 3 N954UW 237
## 4 N961UW 226
## 5 N956UW 217
## 6 N747UW 171
## 7 N951UW 171
## 8 N947UW 162
## 9 N948UW 142
## 10 N958UW 130
## 11 N730MQ 120
## 12 N766US 117
## 13 N953UW 117
## 14 N705TW 111
## 15 N945UW 105
This chapter demonstrates the application of the
data.table in R as an alternative to tidyverse and base R
methods of data wrangling. The authors suggest that
data.table is particularly useful for working with large
volumes of data (e.g. 10 - 100 GB)
library(data.table)
Due to slow loading speeds when importing the full pisa data set,
even with fread() from data.table, we decided to use the
smaller ‘random6’ data set that the authors also made a available.
According to the authors, the random data set contains data collected
from a random sample of a single country with records from all 6 of its
regions.
# read in the random data set
random6 <- fread("random6.csv", na.strings = "")
# check object size - 0.2 gigabytes
print(object.size(random6), unit = "GB")
## 0.2 Gb
# read in pisa data set
#pisa <- fread("pisa2015.csv")
Using data.table syntax to subset the data set, we found that there
are 3,197 female students from Germany in the data set. In addition, an
alternative method using the .N function to return the
number of rows is shown below as well.
# explore data; first 5 and last 5 rows returned when printing a data.table object
#random6 # don't run for now due to loading speed
# subset all female students from the field ST004D01T
#random6[CNT == "Germany" & ST004D01T == "Female"]
# return row count with .N function and chaining
#random6[CNT == "Germany" & ST004D01T == "Female", .N] # don't print code as the output is very large in the markdown report
# first create custom function to convert bin to numeric
bin.to.num <- function(x){
if (is.na(x)) NA
else if (x == "Yes") 1L
else if (x == "No") 0L
}
# using the custom function, create new variables, computer and software, that we can use to calculate the proportion of students with access to these resources
random6[, `:=`
(female = ifelse(ST004D01T == "Female", 1, 0),
sex = ST004D01T,
# At my house we have ...
desk = sapply(ST011Q01TA, bin.to.num),
own.room = sapply(ST011Q02TA, bin.to.num),
quiet.study = sapply(ST011Q03TA, bin.to.num),
computer = sapply(ST011Q04TA, bin.to.num),
software = sapply(ST011Q05TA, bin.to.num),
internet = sapply(ST011Q06TA, bin.to.num),
lit = sapply(ST011Q07TA, bin.to.num),
poetry = sapply(ST011Q08TA, bin.to.num),
art = sapply(ST011Q09TA, bin.to.num),
book.sch = sapply(ST011Q10TA, bin.to.num),
tech.book = sapply(ST011Q11TA, bin.to.num),
dict = sapply(ST011Q12TA, bin.to.num),
art.book = sapply(ST011Q16NA, bin.to.num))]
In Germany, 95% of students in Germany have a computer in their home and 45% of students have educational software.
random6[CNTRYID == "Germany", table(computer)] # 247 No's and 5438 Yes'
## computer
## 0 1
## 247 5438
# computer prop
5438 /(5438 + 247)*100 # 95% of students in data set in Germany have a computer in
## [1] 95.65523
random6[CNTRYID == "Germany", table(software)] # 3038 No's and 2481 Yes'
## software
## 0 1
## 3038 2481
# educational software prop
2481/(2481 + 3038) # 45% have access to educational software
## [1] 0.449538
In Uruguay, 88% of students have a computer in the home and 43% of students have educational software.
random6[CNTRYID == "Uruguay", table(computer)] # 635 No's and 5099 Yes'
## computer
## 0 1
## 635 5099
# computer in home prop
5099/(5099 + 635) # 88% have access to educational software
## [1] 0.8892571
random6[CNTRYID == "Uruguay", table(software)] # 3013 No's and 2313 Yes'
## software
## 0 1
## 3013 2313
# educational software prop
2313/(2313 + 3013) # 43% have access to educational software
## [1] 0.4342846
Among female students in the random6 data set, 72% stated that they have their own room and 85% stated that they have a quiet place to study.
random6[female == 1, table(own.room)] # 4805 No's and 12644 Yes'
## own.room
## 0 1
## 4805 12644
# prop own.room
12644 / (12644 + 4805) #72%
## [1] 0.7246261
random6[female == 1, table(quiet.study)] #2606 No's and 14801 Yes'
## quiet.study
## 0 1
## 2606 14801
# prop quiet.study
14801 / (14801 + 2606) # 85%
## [1] 0.8502901
51% of students have art in their home. The average age of male and female students are 15.8 years.
# proportion of students who have art in their homes:
random6[, .(art = mean(art, na.rm = TRUE), AGE = mean(AGE, na.rm = TRUE)), by = .(sex)]
## sex art AGE
## 1: Female 0.5061029 15.81118
## 2: Male 0.4981741 15.80671
# create custom function to split a column of data by above or below median
median_cut = function(x) {
ifelse(x > median(x), "above_median", "below_median")
}
random6[,
.(own.room = mean(own.room, na.rm = TRUE),
desk_mean = mean(desk, na.rm = TRUE)),
by = .(median_cut(AGE)) ]
## median_cut own.room desk_mean
## 1: above_median 0.7439972 0.8684374
## 2: below_median 0.7441766 0.8680203
The authors show that the code below, using the melt()
function, can be used to reshape the data sets from wide to long
formats.
# recode and store as 'sci_car'; use table to confirm re-coding
random6[,
"sci_car" := sapply(PA032Q03TA,
function(x) {
if (is.na(x)) NA
else if (x == "No") -1L
else if (x == "Yes") 1L
}) ][,
table(sci_car)]
## sci_car
## -1 1
## 5104 4946
Because we are using the more limited random6 data set, we specify to return Germany and Mexico specifically, since the other countries are not in the data set.
random6[CNTRYID %in% c("Germany", "Mexico"),
.(mean(sci_car, na.rm = TRUE)),
by = .(sex, CNTRYID)]
## sex CNTRYID V1
## 1: Female Germany -0.8348018
## 2: Male Germany -0.7085292
## 3: Male Mexico 0.3975610
## 4: Female Mexico 0.3200577
# means and sd of variables hypothesized to predict sci_car
random6[,
.(environ_avg = mean(ENVAWARE, na.rm = TRUE),
environ_sd = sd(ENVAWARE, na.rm = TRUE),
joy_avg = mean(JOYSCIE, na.rm = TRUE),
joy_sd = sd(JOYSCIE, na.rm = TRUE),
self_eff_avg = mean(SCIEEFF, na.rm = TRUE),
self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
)]
## environ_avg environ_sd joy_avg joy_sd self_eff_avg self_eff_sd
## 1: -0.1688647 1.097327 0.06985556 1.117806 -0.008148365 1.204854
# means and sd of focal variables, grouped by sci_car and sex
random6[,
.(environ_avg = mean(ENVAWARE, na.rm = TRUE),
environ_sd = sd(ENVAWARE, na.rm = TRUE),
joy_avg = mean(JOYSCIE, na.rm = TRUE),
joy_sd = sd(JOYSCIE, na.rm = TRUE),
self_eff_avg = mean(SCIEEFF, na.rm = TRUE),
self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
), by = .(sci_car, sex)]
## sci_car sex environ_avg environ_sd joy_avg joy_sd self_eff_avg
## 1: NA Female -0.31842610 0.9874307 -0.09216258 1.1173348 -0.171922907
## 2: -1 Female -0.01864386 0.9728241 -0.11241710 1.0385336 0.016697673
## 3: 1 Male 0.18221720 1.1622236 0.55847968 1.0094256 0.361605273
## 4: -1 Male 0.06017463 1.1645145 0.12967621 1.0792426 0.162395292
## 5: NA Male -0.20814244 1.1916179 0.06879025 1.1356331 -0.009655823
## 6: 1 Female 0.08936607 0.9659264 0.55664182 0.9397745 0.311395444
## self_eff_sd
## 1: 1.186161
## 2: 1.088183
## 3: 1.107026
## 4: 1.141781
## 5: 1.267313
## 6: 1.059443
# first store smaller data set containing observations for all focal variables and sci_car
sci_car_variables <- random6[,
.(sci_car, ENVAWARE, JOYSCIE, SCIEEFF)]
sci_car_variables[,
.(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"),
env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"),
sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"))]
## env_joy_cor joy_self_eff_cor env_sci_car_cor sci_car_joy_cor
## 1: 0.3650092 0.3482787 0.05626268 0.2666819
# recode and store as 'sci_car'; use table to confirm re-coding
random6[,
"math_split" := sapply(PV1MATH,
function(x) {
if (is.na(x)) NA
else if (x <= 490) -1L
else if (x > 490) 1L
})][,
table(math_split)]
## math_split
## -1 1
## 21578 14269
random6[,
"reading_split" := sapply(PV1READ,
function(x) {
if (is.na(x)) NA
else if (x <= 493) -1L
else if (x > 493) 1L}) ][,
table(reading_split)]
## reading_split
## -1 1
## 21191 14656
# calculate correlations with these new variables and those above
# store as table for subsequent formatting
math_reading_cor <- random6[,
.(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"),
env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"),
sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"),
math_joy_cor = cor(x = math_split, y = sci_car, use = "complete.obs"),
reading_joy_cor = cor(x = reading_split, y = sci_car, use = "complete.obs"),
math_env_cor = cor(x = ENVAWARE, y = math_split, use = "complete.obs"),
reading_env_cor = cor(x = JOYSCIE, y = reading_split, use = "complete.obs"),
math_self_eff_cor = cor(x = SCIEEFF, y = math_split, use = "complete.obs"),
reading_self_eff_cor = cor(x = SCIEEFF, y = reading_split, use = "complete.obs"))]
# pivot long using melt(); easier to view correlations this way
melt(math_reading_cor)
## variable value
## 1: env_joy_cor 0.36500924
## 2: joy_self_eff_cor 0.34827868
## 3: env_sci_car_cor 0.05626268
## 4: sci_car_joy_cor 0.26668187
## 5: math_joy_cor -0.21986258
## 6: reading_joy_cor -0.20404487
## 7: math_env_cor 0.12821305
## 8: reading_env_cor 0.05982742
## 9: math_self_eff_cor 0.02318556
## 10: reading_self_eff_cor 0.04272108
random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, .SDcols = c("JOYSCIE", "INTBRSCI")][,,by = sci_car]
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## Ignoring by= because j= is not supplied
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## i and j are both missing so ignoring the other arguments. This warning will be
## upgraded to error in future.
## CNTRYID Mean
## 1: Germany -0.68550
## 2: Germany -1.62215
## 3: Germany -0.97025
## 4: Germany 1.03440
## 5: Germany 0.06700
## ---
## 35843: Uruguay 1.16175
## 35844: Uruguay 0.22755
## 35845: Uruguay -1.29245
## 35846: Uruguay -0.31095
## 35847: Uruguay -0.54705
# create column subset to convert to numeric
cols_to_convert <- c("MISCED", "FISCED")
# convert to numeric with subset index and using lapply()
random6[ ,
(cols_to_convert) := lapply(.SD, as.numeric),
.SDcols = c("MISCED", "FISCED")]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
random6[, lapply(.SD, mean), .SDcols = c("DISCLISCI", "TEACHSUP", "IBTEACH", "TDTEACH")]
## DISCLISCI TEACHSUP IBTEACH TDTEACH
## 1: NA NA NA NA
# run descriptive statistics for three additional variables hypothesized to relate to PA032Q03TA
descriptive_stats <- random6[ ,lapply(.SD, psych::describe),
.SDcols = c("DISCLISCI", "TEACHSUP","TDTEACH")]
# reshape table into long format so that it is easier to read the output
melt(descriptive_stats)
## Warning in melt.data.table(descriptive_stats): id.vars and measure.vars are
## internally guessed when both are 'NULL'. All non-numeric/integer/logical type
## columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
## variable value
## 1: DISCLISCI.vars 1.000000e+00
## 2: DISCLISCI.n 3.140600e+04
## 3: DISCLISCI.mean 1.445539e-01
## 4: DISCLISCI.sd 1.002982e+00
## 5: DISCLISCI.median 3.900000e-03
## 6: DISCLISCI.trimmed 1.504836e-01
## 7: DISCLISCI.mad 1.032186e+00
## 8: DISCLISCI.min -2.416200e+00
## 9: DISCLISCI.max 1.883700e+00
## 10: DISCLISCI.range 4.299900e+00
## 11: DISCLISCI.skew -1.627706e-01
## 12: DISCLISCI.kurtosis -2.219641e-01
## 13: DISCLISCI.se 5.659616e-03
## 14: TEACHSUP.vars 1.000000e+00
## 15: TEACHSUP.n 3.062000e+04
## 16: TEACHSUP.mean 1.259763e-01
## 17: TEACHSUP.sd 9.746136e-01
## 18: TEACHSUP.median 4.540000e-02
## 19: TEACHSUP.trimmed 1.846121e-01
## 20: TEACHSUP.mad 9.899320e-01
## 21: TEACHSUP.min -2.719500e+00
## 22: TEACHSUP.max 1.447500e+00
## 23: TEACHSUP.range 4.167000e+00
## 24: TEACHSUP.skew -3.721514e-01
## 25: TEACHSUP.kurtosis -1.845543e-01
## 26: TEACHSUP.se 5.569675e-03
## 27: TDTEACH.vars 1.000000e+00
## 28: TDTEACH.n 3.012000e+04
## 29: TDTEACH.mean -2.597529e-02
## 30: TDTEACH.sd 9.603662e-01
## 31: TDTEACH.median -1.170000e-02
## 32: TDTEACH.trimmed -3.303284e-02
## 33: TDTEACH.mad 8.873361e-01
## 34: TDTEACH.min -2.447600e+00
## 35: TDTEACH.max 2.078100e+00
## 36: TDTEACH.range 4.525700e+00
## 37: TDTEACH.skew -7.223483e-03
## 38: TDTEACH.kurtosis 3.790284e-01
## 39: TDTEACH.se 5.533621e-03
## variable value